positive effect
HugAgent: Benchmarking LLMs for Simulation of Individualized Human Reasoning
Li, Chance Jiajie, Mo, Zhenze, Tang, Yuhan, Qu, Ao, Wu, Jiayi, Zhao, Kaiya Ivy, Gan, Yulu, Fan, Jie, Yu, Jiangbo, Jiang, Hang, Liang, Paul Pu, Zhao, Jinhua, Pastor, Luis Alberto Alonso, Larson, Kent
Simulating human reasoning in open-ended tasks has long been a central aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), which rethinks human reasoning simulation along three dimensions: (i) from averaged to individualized reasoning, (ii) from behavioral mimicry to cognitive alignment, and (iii) from vignette-based to open-ended data. The benchmark evaluates whether a model can predict a specific person's behavioral responses and the underlying reasoning dynamics in out-of-distribution scenarios, given partial evidence of their prior views. HugAgent adopts a dual-track design: a human track that automates and scales the think-aloud method to collect ecologically valid human reasoning data, and a synthetic track for further scalability and systematic stress testing. This architecture enables low-cost, extensible expansion to new tasks and populations. Experiments with state-of-the-art language models reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. The benchmark, along with its complete data collection pipeline and companion chatbot, is open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Personal > Interview (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (3 more...)
Investigating Lexical Change through Cross-Linguistic Colexification Patterns
Gfeller, Kim, Stoll, Sabine, Cathcart, Chundra, Widmer, Paul
One of the most intriguing features of language is its constant change, with ongoing shifts in how meaning is expressed. Despite decades of research, the factors that determine how and why meanings evolve remain only partly understood. Colexification -- the phenomenon of expressing multiple distinct concepts using the same word form -- serves as a valuable window onto the dynamics of meaning change across languages. Here, we apply phylogenetic comparative models to dictionary data from three language families, Austronesian, Indo-European, and Uralic, in order to shed light on the evolutionary dynamics underlying the colexification of concept pairs. We assess the effects of three predictors: associativity, borrowability, and usage frequency. Our results show that more closely related concept pairs are colexified across a larger portion of the family tree and exhibit slower rates of change. In contrast, concept pairs that are more frequent and more prone to borrowing tend to change more rapidly and are less often colexified. We also find considerable differences between the language families under study, suggesting that areal and cultural factors may play a role.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- (10 more...)
Analysis of Threat-Based Manipulation in Large Language Models: A Dual Perspective on Vulnerabilities and Performance Enhancement Opportunities
Large Language Models (LLMs) demonstrate complex responses to threat-based manipulations, revealing both vulnerabilities and unexpected performance enhancement opportunities. This study presents a comprehensive analysis of 3,390 experimental responses from three major LLMs (Claude, GPT-4, Gemini) across 10 task domains under 6 threat conditions. We introduce a novel threat taxonomy and multi-metric evaluation framework to quantify both negative manipulation effects and positive performance improvements. Results reveal systematic vulnerabilities, with policy evaluation showing the highest metric significance rates under role-based threats, alongside substantial performance enhancements in numerous cases with effect sizes up to +1336%. Statistical analysis indicates systematic certainty manipulation (pFDR < 0.0001) and significant improvements in analytical depth and response quality. These findings have dual implications for AI safety and practical prompt engineering in high-stakes applications.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Law (1.00)
- Health & Medicine (0.94)
- Banking & Finance (0.93)
- (2 more...)
Causal Explanations Over Time: Articulated Reasoning for Interactive Environments
Rödling, Sebastian, Zečević, Matej, Dhami, Devendra Singh, Kersting, Kristian
Structural Causal Explanations (SCEs) can be used to automatically generate explanations in natural language to questions about given data that are grounded in a (possibly learned) causal model. Unfortunately they work for small data only. In turn they are not attractive to offer reasons for events, e.g., tracking causal changes over multiple time steps, or a behavioral component that involves feedback loops through actions of an agent. To this end, we generalize SCEs to a (recursive) formulation of explanation trees to capture the temporal interactions between reasons. We show the benefits of this more general SCE algorithm on synthetic time-series data and a 2D grid game, and further compare it to the base SCE and other existing methods for causal explanations.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Revealed: The formula for the perfect day - including a short shift at WORK
In the search for happiness, having a good day every day is surely crucial. But when there are so many pursuits competing for our attention, sometimes it's difficult to know how much time to allocate for each one. Now, scientists in Canada claim to cracked the code for the perfect day – and surprisingly, it includes a short shift at work. According to the experts, the formula for the perfect day is six hours of family time, two hours spent with friends, 1.5 hour socialising, two hours exercising and one hour eating and drinking. Additionally, the perfect day should involve no more than six hours of work and less than 15 minutes commuting.
Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models
Wu, Jinyang, Che, Feihu, Zhang, Chuyuan, Tao, Jianhua, Zhang, Shuai, Shao, Pengpeng
Retrieval-Augmented Generation (RAG) has emerged as a crucial method for addressing hallucinations in large language models (LLMs). While recent research has extended RAG models to complex noisy scenarios, these explorations often confine themselves to limited noise types and presuppose that noise is inherently detrimental to LLMs, potentially deviating from real-world retrieval environments and restricting practical applicability. In this paper, we define seven distinct noise types from a linguistic perspective and establish a Noise RAG Benchmark (NoiserBench), a comprehensive evaluation framework encompassing multiple datasets and reasoning tasks. Through empirical evaluation of eight representative LLMs with diverse architectures and scales, we reveal that these noises can be further categorized into two practical groups: noise that is beneficial to LLMs (aka beneficial noise) and noise that is harmful to LLMs (aka harmful noise). While harmful noise generally impairs performance, beneficial noise may enhance several aspects of model capabilities and overall performance. Our analysis offers insights for developing more robust, adaptable RAG solutions and mitigating hallucinations across diverse retrieval scenarios.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
Correlation Does Not Imply Compensation: Complexity and Irregularity in the Lexicon
Doucette, Amanda, Cotterell, Ryan, Sonderegger, Morgan, O'Donnell, Timothy J.
It has been claimed that within a language, morphologically irregular words are more likely to be phonotactically simple and morphologically regular words are more likely to be phonotactically complex. This inverse correlation has been demonstrated in English for a small sample of words, but has yet to be shown for a larger sample of languages. Furthermore, frequency and word length are known to influence both phonotactic complexity and morphological irregularity, and they may be confounding factors in this relationship. Therefore, we examine the relationships between all pairs of these four variables both to assess the robustness of previous findings using improved methodology and as a step towards understanding the underlying causal relationship. Using information-theoretic measures of phonotactic complexity and morphological irregularity (Pimentel et al., 2020; Wu et al., 2019) on 25 languages from UniMorph, we find that there is evidence of a positive relationship between morphological irregularity and phonotactic complexity within languages on average, although the direction varies within individual languages. We also find weak evidence of a negative relationship between word length and morphological irregularity that had not been previously identified, and that some existing findings about the relationships between these four variables are not as robust as previously thought.
- North America > Canada > Quebec > Montreal (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations
Lee, Yoonjoo, Son, Kihoon, Kim, Tae Soo, Kim, Jisu, Chung, John Joon Young, Adar, Eytan, Kim, Juho
As Large Language Models (LLMs) are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated. If run again, the LLM may correct itself and produce the correct answer. Unfortunately, most LLM-powered systems resort to single results which, correct or not, users accept. Having the LLM produce multiple outputs may help identify disagreements or alternatives. However, it is not obvious how the user will interpret conflicts or inconsistencies. To this end, we investigate how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs. Through a preliminary study, we identified five types of output inconsistencies. Based on these categories, we conducted a study (N=252) in which participants were given one or more LLM-generated passages to an information-seeking question. We found that inconsistency within multiple LLM-generated outputs lowered the participants' perceived AI capacity, while also increasing their comprehension of the given information. Specifically, we observed that this positive effect of inconsistencies was most significant for participants who read two passages, compared to those who read three. Based on these findings, we present design implications that, instead of regarding LLM output inconsistencies as a drawback, we can reveal the potential inconsistencies to transparently indicate the limitations of these models and promote critical LLM usage.
- Europe > United Kingdom (0.14)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (0.93)
Scientists reveal how long YOU should walk to boost brain power
Facebook founder Mark Zuckerberg reportedly loves conducting meetings while walking, and so did Apple founder Steve Jobs - and scientists have shown that they were right on target. Just 20 minutes of walking can prepare the brain to take in and retain new information, neuroscience research has shown. These positive effects can be seen in areas of the brain involved in making decisions, managing stress, and planning our behavior. Other forms of exercise have their own benefits on brain health, too, but this research determined that it doesn't take much to boost your brain power - and a little bit of walking is much better than no exercise at all. Just 20 minutes of walking can prepare the brain to take in and retain new information, neuroscience research has shown.
- North America > United States > Virginia (0.05)
- North America > United States > Illinois (0.05)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Health & Medicine > Consumer Health (1.00)
Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks
Alizadeh, Meysam, Kubli, Maël, Samei, Zeynab, Dehghani, Shirin, Bermeo, Juan Diego, Korobeynikova, Maria, Gilardi, Fabrizio
For instance, studies demonstrate that ChatGPT exceeds the performance of crowd-workers in tasks encompassing relevance, stance, sentiment, topic identification, and frame detection (Gilardi, Alizadeh and Kubli, 2023), that it outperforms trained annotators in detecting the political party affiliations of Twitter users (Törnberg, 2023), and that it achieves accuracy scores over 0.6 for tasks such as stance, sentiment, hate speech detection, and bot identification (Zhu et al., 2023). Notably, ChatGPT also demonstrates the ability to correctly classify more than 70% of news as either true or false (Hoes, Altay and Bermeo, 2023), which suggests that LLMs might potentially be used to assist content moderation processes. While the performance of LLMs for text annotation is promising, there are several aspects that remain unclear and require further research. Among these is the impact of different approaches such as zero-shot versus few-shot learning and settings such as varying temperature parameters. Zero-shot learning allows models to predict for unseen tasks, while few-shot learning uses a small number of examples to generalize to new tasks. The conditions under which one approach outperforms the other are not fully understood yet.
- Asia > Russia (0.46)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Russia (0.14)
- (19 more...)